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Recommender systems are present in many web applications to guide our choices. 
They increase sales and benefit sellers, but whether they benefit cnstomers by pro¬ 
viding relevant products is questionable. Here we introduce a model to examine the 
benefit of recommender systems for users, and found that recommendations from the 
system can be equivalent to random draws if one relies too strongly on the system. 
Nevertheless, with sufficient information about user preferences, recommendations 
become accurate and an abrupt transition to this accurate regime is observed for 
some algorithms. On the other hand, we found that a high accuracy evaluated 
by common accuracy metrics does not necessarily correspond to a high real accu¬ 
racy nor a benefit for users, which serves as an alarm for operators and researchers 
of recommender systems. We tested our model with a real dataset and observed 
similar behaviors. Finally, a recommendation approach with improved accuracy is 
suggested. These results imply that recommender systems can benefit users, but 
relying too strongly on the system may render the system ineffective. 


Introduction 


Almost all popular websites employ recommender systems to match users with items [lH4|. 
For instance, news websites analyze the reading history of individuals and recommend news 
which match their interests 1^ ; online social networks recommend new friends to individuals 

ri 

based on their existing friends [6|. Most commonly, online retailers analyze the purchase 
history of customers and recommend products to them to increase their own sales 7H9|. 
These examples show an increasingly crucial role of recommender systems in our daily life, 
influencing our various choices. 

Due to their broad applications, great efforts have been devoted to study recommendation 
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algorithms and to improve their accuracy Researchers in computer science, mathematics 
and management science employ various mathematical tools such as Bayesian approach and 
matrix factorization to derive recommendation algorithms 


n 


inl-ll2l| . Recently, physicists 


and complex system scientists started to work in the area and incorporated physical processes 
such as mass diffusion and heat conduction to recommender system ISj. Nevertheless, the 
main goal of these studies is limited to recommendation accuracy, but their genuine benehts 
are less examined. 

Although recommender systems have been shown to beneht retailers, whether the rec¬ 
ommended products are relevant to customers is questionable 9|, llJ]- On one hand, many 


recommendation algorithms are based on product similarity and the recommended products 
may be redundant since they are similar to the already purchased products jl^. On the 
other hand, instead of specihc products which match individual needs, many recommender 


systems can only recommend popular but potentially irrelevant products j^, ll5|. Neverthe¬ 
less, users may be tempted to purchase the products due to recommendations, and in this 
case recommender systems beneht sellers but not customers. 

In this paper, we introduce a simple model to examine the relevance between the rec¬ 
ommended products and the preferences of users. Unlike empirical studies where the true 
user preference is unknown, each user in the model is characterized by a taste and the true 
recommendation accuracy can be measured. We found that recommendations can be either 
random or very accurate depending on the frequency the users select a product without 
recommendations. For some algorithms, an abrupt increase in accuracy is observed when 
this frequency exceeds a threshold. On the other hand, we found that a high accuracy in¬ 
dicated by common evaluation metrics does not necessarily imply to a high real accuracy. 
We tested our model using the MovieLens dataset and observed similar behaviors. Fi¬ 
nally, a recommendation approach based on our hndings was suggested which outperforms 
conventional approaches. 


Model 

Specihcally, we consider a group of N users selecting products from a group of M items. 
Each user i and item a is characterized by one of the G tastes or genres, denoted by gi and Qa 
respectively. For instance, in terms of movies, these tastes may correspond to science hctions. 
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romantic comedies or thrillers. The case where users have multiple tastes are described in 
Section [O 

At each time step, a user i is randomly drawn. With a fraction fsei of the times, user i 
chooses a product matching his/her own taste without using the recommender system. This 
is the conventional way to purchase a product and we call fse\ the frequency of deliberate 
selection. On the other hand, with a fraction 1 — f^ei of the times, user i buys a product 
following the recommender system. In both cases, a product in his/her collection is randomly 
removed since all products are assumed to be consumable and can be brought and consumed 
for more than once. In this case, the total number of products collected by user i remains 
constant at fcj, which simplihes our model as network growth is not required and N and M 
remain constant. The above procedures are repeated for a large number of times per user. 

We remark that the recommender system has no direct knowledge of user taste and 
product genre, it can only infer user preferences through his/her purchase history. Since fse\ 
is the frequency a user makes purchases in the absence of recommender systems, on average 
at least fsei of the purchases of user i must match his/her taste; fse\ is thus proportional 
to the amount of available hints the recommender systems can exploit. We further dehne 
recommendation accuracy Arec to be the fraction of recommended products which match 
the taste of the user, and our goal is to examine Aj-ec to reveal the beneht of recommender 
systems to users. 

For simplicity, we employ the common Item-based Collaborative Filtering (IGF) 0 to be 
the recommendation algorithm in our model. IGF provides personalized recommendations 
to users by computing similarity between their purchased products with other products. 
We hrst denote the similarity between item a and (3 at time t to be Sayit). As shown 


by previous studies 


17|, the performance of the algorithm is strongly dependent on the 


dehnition of similarity. To shown that our results are relevant to different recommendation 
algorithms, we will employ two dehnitions of similarity, namely the common neighbor (GN) 
similarity, given by 


N 


dCN), 


’o/3 ^ ^ 0‘ia(t)0‘il3if) 1 

and the cosine similarity [17|, given by 


( 1 ) 


i=0 


^/kakg ^ 
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( 2 ) 
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The adjacency variable ttiait) = 1 if item a is collected by user i at time t, and otherwise 
= 0. The recommendation score Viait) of product a for user i at time t is given by 

M 

/ 3=1 0£Ci{t) 

where Ci{t) is the set of products collected by user i at time t. Finally, the product with 
the highest score not yet collected by the user is recommended. 

Results 

A. Random versus accurate recommendations 

To examine the beneht of recommender system to users, we hrst study the dependence 
of recommendation accuracy A^ec on the frequency fsei of deliberate selection. The higher 
the value of fse\, the more often the user chooses a product of a matching taste without 
recommendation, and the more the information for the recommender system to exploit. If 
recommender systems work perfectly, = 100% = 1 whenever fse\ > 0 as there exists non¬ 
zero information about user tastes in the dataset; on the other hand, if recommender systems 
do not work at all, recommendations are always random, and Ay-ec = 1/G independent of 

/sel- 

As shown in Fig. [H the recommendation accuracy falls between the two extreme cases. 
The common neighbor similarity is employed in Fig.[T]^a), and Arec ~ l/G which corresponds 
to the case of random recommendations when /sei is less than a threshold. When /sei increases 
beyond the threshold, recommendation accuracy increases abruptly to A^ec = 1, which 
corresponds to a case of perfect recommendation. As shown in Fig. [T](b), cosine similarity is 
employed and a similar dependence of Arec on /sei is observed, though the transition between 
the two phases is more gentle. We remark that A^ec = 1 is an artifact of the model since each 
user and product is categorized by only one taste, and after users and products of the same 
taste formed an isolated bipartite cluster, only products within the cluster are recommended 
and lead to a persistent perfect accuracy. 

The accuracy Arec is also dependent on the number of taste group G. Intuitively, the 
threshold value for perfect recommendation decreases with G, since it seems easier to iden¬ 
tify an item with the correct taste out of a smaller number of taste groups. However, 
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FIG. 1: The accuracy ^rec of the recommender system as a function of /sei for different number of 
taste groups G. The simulation results were obtained with N = 2000 users and M = 100 products. 
Each user collects k = 7 products and is updated 1 x 10® times. Each data point was averaged 
over 50 instances. The common neighbor similarity Eq. ([1]) and the cosine similarity Eq. ([2]) were 
employed in (a) and (b) respectively. 

simulated results in both Fig. [T]^a) and (b) show that the threshold value increases when G 
decreases. It is because users collect products of both relevant and irrelevant taste; when G 
is small, the irrelevant products belong to a small number of taste groups, and there exists 
a strong connection between users and each irrelevant taste group, making it difficult for 
the recommender system to identify these false connections. In short, the more diverse and 
distinct the users and products, the less amount of hints are required to provide correct 
recommendat ions. 

Other than the number of taste group, recommendation accuracy also depends on the 
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FIG. 2: The accuracy Aj-^c of the recommender system as a function of f^ei for different values of k, 
the number of products collected per user. The simulation results were obtained with N = 2000, 
M = 100 and G = 10. Each user was updated 1 x 10^ times, and each data point was averaged 
over 50 instances. The common neighbor similarity Eq. ([1]) and the cosine similarity Eq. ([2]) were 
employed in (a) and (b) respectively. 

number of items collected by each user. For simplicity, all users collect the same number 
of items, i.e. ki = k for Vh As shown in Fig. |2]J^a) and (b), perfect recommendation is 
more difficult to be achieved for cases with larger k, where the stronger connection between 
users and irrelevant taste groups is again the reason. These results imply that when users 
collect a large number of products, false connections exist and may impact negatively on 
the recommender system. Hence, instead of drawing recommendations based on all the 
available data, an algorithm which effectively eliminates the false connections may lead to 
a high recommendation accuracy. 
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The above results suggest that recommender systems may provide irrelevant recommen¬ 
dations when users do not provide sufficient hints about their taste. On the other hand, 
given sufficient hints, recommender systems well utilize the information to match users with 
products. The amount of hints required for accurate recommendation is different for different 
algorithms and systems. 


B. Estimated accuracy versus real accuracy 


In real systems, since the real preference of users is unknown, there is no way to measure 
the real recommendation accuracy. Various metrics are thus introduced to evaluate recom¬ 
mendation accuracy. Nevertheless, whether these metrics correctly measure real accuracy is 
questionable. Since user taste and product genre are dehned in our model, we can compare 
the accuracy measured by these metrics with the real accuracy. 

One common metric to evaluate recommendation accuracy is AUC^ i.e. the area under 
the receiver operating curve (ROC). When recommendations are made for user i, AUC is 
computed as the probability that a correct product a is ranked higher than an arbitrary 
product 7 , given by 


AUQ^ 


n{ri^ < Via) + 0.5n(rj^ = Via) 

M -ki 


(4) 


where n(rj.y < Via) is the number of products with score lower than the score of 
the correct product, and n{ri^ = rj„) is the number of items which tie with the correct 
item. Based on the definition of correct predictions, we compute two AUC measures - (i) 
the conventional estimated AUCgsti obtained by dividing the dataset into a training set and 
a probe set; links in the probe set are removed and are considered to be correct predictions 
if their existence are predicted; and (ii) the real AUCreai which quantihes the accuracy of 
the algorithm in recommending products of a matching taste. 

The dependence of AUC^st and AUCj-eai on /gei is shown in Fig. |3l As we can see, 
At/Creai ~ 0.5 when /sei is small since recommendations are random (see Fig. [1]) and the 
products of a matching taste are randomly ranked in the recommendation list. However, 
AUCest is much higher and is not consistent with AUC^eai- The reason for a large AUC^st 
at small /sei is the frequent application of recommender systems, such that user purchases 
are strongly influenced by the algorithms regardless of their true preference. In this case. 
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FIG. 3: The two different AUC measures, AUCf^st and AUC^^g^i, as a function of /sei, obtained 
by IGF with common neighbor similarity and cosine similarity (inset) on systems with N = 2000, 
M = 100, k = ?> and G = 10. 

products which do not match their preference but are consistent with the algorithms are 
also collected by the users. This favors the evaluation by AUCest using a random probe set, 
and lead to a high AUC est even random recommendations are indeed provided. 

When /sel increases, AUCest decreases since the user-product relations become less influ¬ 
enced by the recommender system. At the same time, AUCy-esx increases since more hints 
about the user tastes are present. We remark that although A-^ec ~ 1/G when f^ei is smaller 
than the threshold (see Fig. [11(a)), the corresponding AUC-reai is increasing in the same 
regime. Finally, AUC^eai and AUCest become consistent when fsei further increases and the 
system achieves perfect recommendation. 

The above results imply that the conventional evaluation of recommendation accuracy 
may not necessarily reflect the true accuracy. Indeed, AUCest niay over-estimate the accu¬ 
racy of the algorithm, especially in cases where users rely frequently on the recommender 
system and do not reveal their own taste by deliberately selecting products. This serves as 
an alarm for researchers and operators of recommender systems. Alternative evaluations are 
therefore necessary to supplement conventional accuracy metrics to quantify the benefit of 
recommender systems for users. 
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FIG. 4: The fraction Arec of recommended items in taste 1 as a function of /i, the fraction of the 
selected products in taste 1. The simulations are obtained with N = 2000, M = 100, G = 10 and 
fsei = 0.95 for 5 X 10^updates averaged over 50 instances. Only results obtained with common 
neighbor similarity are shown. 


C. Users with multiple tastes 

Ordinary users usually have more than one interests, for instance, a user may be interested 
in both scientihc hction and action movies. To model this scenario, we assume that each user 
is characterized by two tastes, which we denote by taste 1 and taste 2. Similar to the previous 
case, with /sei of the times, the user selects a product in the absence of recommender systems; 
otherwise, the recommendation algorithm is applied. When a user selects a product, fi of 
the selected products are in taste 1 and the rest are in taste 2. To simplify the model, 
we only study cases with large /sei, with which perfect recommendation is achieved in the 
original single-taste system. 

Since a fraction fi of the selected products of the user are in taste 1, the ratio /i/(l — /i) 
corresponds to his/her preference between the two tastes. If optimal recommendations are 
achieved, fi of the recommended products should be in taste 1 and 1 — /i of them should be 
in taste 2. Nevertheless, as shown in Fig. 01 the fraction Arec of the recommended products 
in taste 1 does not coincide with the optmial line Arec = fi- For instance, when /i is small, 
the recommendations are mainly in taste 2. It leads to a sub-optimal state which under¬ 
represent the minority taste, i.e. taste 1 when fi < 0.5, among the recommended products. 
Similarly, taste 2 is under-represented when fi > 0.5. As we can see in Fig. 01 the difference 
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between Arel and /i is larger when k is larger. This implies an increasing difficulty for 
the recommender system to identify a secondary taste if the user-product connections are 
denser. We remark that the results by employing the common neighbor similarity and the 
cosine similarity are almost identical. 

On the other hand, one may expect a perfect recommendation regime at /i/sei > /sen 
where f*^i denotes the threshold value, or equivalently the smallest fse\ at which the system 
achieves perfect recommendation in the corresponding single-taste scenario. For the system 
parameters employed in Fig. 01 0.73, but perfect recommendations in taste 1 are 

not achieved with /i > /sei//sei = 0.77 (indicated by the dotted line in Fig. 0]) due to the 
presence of taste 2. 


D. Tests with empirical datasets 


Finally, we incorporate our model with a real dataset obtained from MovieLens 16|. 
Since user taste and product genre are unknown in real systems, we again randomly divide 
the dataset into a training set and a probe set, and consider the recommended movie to 
be correct only if it was collected by the user and received a rating of 3 (in a scale from 1 
to 5) from the user as recorded in the data. Similar to our model, with /sei of the times, 
a user deliberately selects a correct movie and otherwise the recommendation algorithm is 
applied. For those users who rated at least two movies with a score of 3 or above, we set 
their degree to be fc, — 1 such that an un-collected correct movie always exists. As in the 
previous simulations, a user randomly removes one of his/her collected movies when he/she 
obtains a new movie; the system is then repeatedly updated. 

As shown in Fig. [5](a), the accuracy A^ec obtained by both similarity dehnitions starts at 
a low value and increases with fsei- Nevertheless, it does not show an abrupt jump to a high 
value similar to previous simulations but a plateau at small fsei and a small jump at large fsei 
are observed in the case with cosine similarity. These results again suggest that sufficient 
hints about user taste are essential for the system to obtain accurate recommendations. 
When /sei approaches 1, Arec decreases since users have collected most of the correct movies 
through deliberate selection and it becomes more difficult for the recommender system to 
identify the fewer correct items among all the other items. 

As shown in Fig. |5](b), the dependence of AUCest and AUCreai on fsei is similar to that 
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FIG. 5: (a) The recommendation accuracy Arec as a function of /sei, obtained by incorporating our 
model with the MovieLens dataset with 944 users and 1683 products, and 5000 updates per user, 
(b) The corresponding estimated AUCest and the real AUC,-ea,i as a function of fsei- 

observed from the previous simulations. When fsei is small, the conventional AUC metric 
over-estimates the accuracy of the recommender system. Especially, AUCest is highest when 
AUCj-eai is lowest, and AUCest = AtlGreai only when f^ei = 1- This suggests that conventional 
metrics may again be over-estimate recommendation accuracy in real systems. 

E. A recommendation algorithm with improvement 

Based on the previous results, we slightly modify the IGF algorithm to improve the 
recommendation accuracy. The rationale is simple - since products deliberately selected by 
users usually match their taste, we simply give a higher weight to these products during 
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FIG. 6: The accuracy Arec of the original IGF compared with IGF biased on products collected via 
deliberate selection (with 5 = 2 in Eq. ([5])). The results are obtained by (a) the common neighbor 
(CN) similarity and (b) the cosine similarity on generated networks with N = 2000, M = 100, 
k = 7 and G = 10. The corresponding results on the MovieLens dataset are shown in (c) and (d). 


the computation of recommendation scores, by modifying the adjacency variable ttiait) as 
follows: 


^ia (t') 


0 if a ^ Ci(t), 

^ 1 if a G Ci{t) via recommendation, 
b if a G Ci{t) via selection, 


(5) 


where Ci(t) is again the set of products collected by user i at time t, and 6 > 1 is the bias 
on products collected via deliberate selection. The recommendation score of an item are 
then computed by the same formula Eq. ([2]). The recommendation accuracy obtained by 
the modihed algorithm is compared to that of the original algorithm in Fig. |6l As we can 
see from Fig. [6](a) and (b), perfect recommendations are achieved at a smaller /sei when 
selected products are weighed more in the algorithm. Similar results are observed with 
the MovieLens datasets as shown in Fig. [H](c) and (d). These results imply that products 
deliberately chosen by users are essential information to improve recommendation accuracy. 
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Discussion 

To reveal the benefit of recommender systems for users, we studied a simple model where 
users either choose their own products or follow the recommendations from the system. Our 
results show that the recommendations may be equivalent to random draws if users rely 
too strongly on the recommender system and do not reveal their own taste by deliberately 
selecting products. On the other hand, if sufficient information about their taste is present, 
recommendation systems are able to achieve high accuracy in matching appropriate products 
to users. For some recommendation algorithms, the increase in accuracy is abrupt once the 
amount of available information exceeds a threshold. These results imply that recommender 
systems can beneht users, but relying too strongly on the system may render the system 
ineffective. 

On the other hand, our study reveals the difficulties to obtain a realistic and accurate 
evaluation of recommendation accuracy. Since real user preference is unknown, evaluation of 
recommender algorithms usually involves removing a set of existing data and quantihes their 
accuracy by their success to retrieve the removed set. Our results show that such metrics 
do not necessarily reflect and may over-estimate the true accuracy of the algorithm. This 
is because the choice of products collected by users was previously influenced by the recom¬ 
mendation algorithms; the presence of these products may not reflect their true preference 
and may favor the evaluation by the conventional accuracy metrics. The disagreement be¬ 
tween the estimated and the real accuracy was observed in simulations with both generated 
network and a real dataset. These results imply that a high recommendation accuracy indi¬ 
cated by the conventional metrics may not necessarily imply a beneht for users. Alternative 
evaluations are necessary to supplement these metrics in order to quantify the effectiveness 
of the recommender systems. 
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Recommender systems are present in many web applications to guide our choices. 
They increase sales and benefit sellers, but whether they benefit customers by pro¬ 
viding relevant products is questionable. Here we introduce a model to examine the 
benefit of recommender systems for users, and found that recommendations from the 
system can be equivalent to random draws if one relies too strongly on the system. 
Nevertheless, with sufficient information about user preferences, recommendations 
become accurate and an abrupt transition to this accurate regime is observed for 
some algorithms. On the other hand, we found that a high accuracy evaluated 
by common accuracy metrics does not necessarily correspond to a high real accu¬ 
racy nor a benefit for users, which serves as an alarm for operators and researchers 
of recommender systems. We tested our model with a real dataset and observed 
similar behaviors. Finally, a recommendation approach with improved accuracy is 
suggested. These results imply that recommender systems can benefit users, but 
relying too strongly on the system may render the system ineffective. 



Introduction 


Almost all popular websites employ recommender systems to match users with items [1-4], 
For instance, news websites analyze the reading history of individnals and recommend news 
which match their interests [5]; online social networks recommend new friends to individnals 
based on their existing friends [6]. Most commonly, online retailers analyze the purchase 
history of customers and recommend products to them to increase their own sales [7-9]. 
These examples show an increasingly crucial role of recommender systems in our daily life, 
influencing onr varions choices. 

Dne to their broad applications, great efforts have been devoted to stndy recommendation 
algorithms and to improve their accnracy [4]. Researchers in compnter science, mathematics 
and management science employ various mathematical tools such as Bayesian approach and 
matrix factorization to derive recommendation algorithms [4, 10-12]. Recently, physicists 
and complex system scientists started to work in the area and incorporated physical processes 
snch as mass diffnsion and heat condnction to recommender system [13]. Nevertheless, the 
main goal of these stndies is limited to recommendation accnracy, bnt their gennine benefits 
are less examined. 

Although recommender systems have been shown to benefit retailers, whether the rec¬ 
ommended products are relevant to customers is questionable [9, 14]. On one hand, many 
recommendation algorithms are based on product similarity and the recommended products 
may be redundant since they are similar to the already pnrchased prodncts [13]. On the 
other hand, instead of specific prodncts which match individnal needs, many recommender 
systems can only recommend popnlar bnt potentially irrelevant prodncts [9, 15]. Neverthe¬ 
less, users may be tempted to purchase the products due to recommendations, and in this 
case recommender systems benefit sellers but not customers. 

In this paper, we introduce a simple model to examine the relevance between the rec¬ 
ommended prodncts and the preferences of nsers. Unlike empirical stndies where the trne 
nser preference is nnknown, each nser in the model is characterized by a taste and the trne 
recommendation accnracy can be measnred. We fonnd that recommendations can be either 
random or very accurate depending on the frequency the users select a product without 
recommendations. For some algorithms, an abrupt increase in accuracy is observed when 
this freqnency exceeds a threshold. On the other hand, we fonnd that a high accnracy in- 



dicated by common evaluation metrics does not necessarily imply to a high real accuracy. 
We tested our model using the MovieLens dataset [16] and observed similar behaviors. Fi¬ 
nally, a recommendation approach based on our findings was suggested which outperforms 
conventional approaches. 


Model 

Specifically, we consider a group of N users selecting products from a group of M items. 
Each user i and item a is characterized by one of the G tastes or genres, denoted by Qi and 
respectively. For instance, in terms of movies, these tastes may correspond to science fictions, 
romantic comedies or thrillers. The case where users have multiple tastes are described in 
Section C. 

At each time step, a user i is randomly drawn. With a fraction /sei of the times, user i 
chooses a product matching his/her own taste without using the recommender system. This 
is the conventional way to purchase a product and we call /sei the frequency of deliberate 
selection. On the other hand, with a fraction 1 — /sei of the times, user i buys a product 
following the recommender system. In both cases, a product in his/her collection is randomly 
removed since all products are assumed to be consumable and can be brought and consumed 
for more than once. In this case, the total number of products collected by user i remains 
constant at ki, which simplifies our model as network growth is not required and N and M 
remain constant. The above procedures are repeated for a large number of times per user. 

We remark that the recommender system has no direct knowledge of user taste and 
product genre, it can only infer user preferences through his/her purchase history. Since /sei 
is the frequency a user makes purchases in the absence of recommender systems, on average 
at least /gei of the purchases of user i must match his/her taste; /sei is thus proportional 
to the amount of available hints the recommender systems can exploit. We further define 
recommendation accuracy A^-^c to be the fraction of recommended products which match 
the taste of the user, and our goal is to examine Arec to reveal the benefit of recommender 
systems to users. 

For simplicity, we employ the common Item-based Collaborative Filtering (IGF) [17] to be 
the recommendation algorithm in our model. IGF provides personalized recommendations 
to users by computing similarity between their purchased products with other products. 



We first denote the similarity between item a and /5 at time t to be Sapit). As shown 
by previous studies [17], the performance of the algorithm is strongly dependent on the 
definition of similarity. To shown that our results are relevant to different recommendation 
algorithms, we will employ two definitions of similarity, namely the common neighbor (CN) 
similarity^ given by 

N 

(i), (1) 

1=0 

and the cosine similarity [17], given by 


= a.. (*)«.,(*). 


( 2 ) 


The adjacency variable aia{t) = 1 if item a is collected by user i at time t, and otherwise 
o-ia{t) = 0. The recommendation score of product a for user i at time t is given by 

M 

0,ip(t) Soipit) ^ ^ ■^ap(j'')j (3) 


3=1 




where Ci{t) is the set of products collected by user i at time t. Finally, the product with 
the highest score not yet collected by the user is recommended. 


Results 

A. Random versus accurate recommendations 

To examine the benefit of recommender system to users, we first study the dependence 
of recommendation accuracy A^ec on the frequency /sei of deliberate selection. The higher 
the value of fsei, the more often the user chooses a product of a matching taste without 
recommendation, and the more the information for the recommender system to exploit. If 
recommender systems work perfectly, Arec = 100% = 1 whenever /sei > 0 as there exists non¬ 
zero information about user tastes in the dataset; on the other hand, if recommender systems 
do not work at all, recommendations are always random, and Arec = 1/G independent of 

/sei- 

As shown in Fig. 1, the recommendation accuracy falls between the two extreme cases. 
The common neighbor similarity is employed in Fig. 1(a), and Arec ~ 1/G which corresponds 



to the case of random recommendations when fse\ is less than a threshold. When fse\ increases 
beyond the threshold, recommendation accuracy increases abruptly to = 1, which 
corresponds to a case of perfect recommendation. As shown in Fig. 1(b), cosine similarity is 
employed and a similar dependence of Arec on /gei is observed, though the transition between 
the two phases is more gentle. We remark that Arec = 1 is an artifact of the model since each 
user and product is categorized by only one taste, and after users and products of the same 
taste formed an isolated bipartite cluster, only products within the cluster are recommended 
and lead to a persistent perfect accuracy. 

The accuracy Arec is also dependent on the number of taste group G. Intuitively, the 
threshold value for perfect recommendation decreases with G, since it seems easier to iden¬ 
tify an item with the correct taste out of a smaller number of taste groups. However, 
simulated results in both Fig. 1(a) and (b) show that the threshold value increases when G 
decreases. It is because users collect products of both relevant and irrelevant taste; when G 
is small, the irrelevant products belong to a small number of taste groups, and there exists 
a strong connection between users and each irrelevant taste group, making it difficult for 
the recommender system to identify these false connections. In short, the more diverse and 
distinct the users and products, the less amount of hints are required to provide correct 
recommendations. 

Other than the number of taste group, recommendation accuracy also depends on the 
number of items collected by each user. For simplicity, all users collect the same number 
of items, i.e. ki = k for \/i. As shown in Fig. 2(a) and (b), perfect recommendation is 
more difficult to be achieved for cases with larger k, where the stronger connection between 
users and irrelevant taste groups is again the reason. These results imply that when users 
collect a large number of products, false connections exist and may impact negatively on 
the recommender system. Hence, instead of drawing recommendations based on all the 
available data, an algorithm which effectively eliminates the false connections may lead to 
a high recommendation accuracy. 

The above results suggest that recommender systems may provide irrelevant recommen¬ 
dations when users do not provide sufficient hints about their taste. On the other hand, 
given sufficient hints, recommender systems well utilize the information to match users with 
products. The amount of hints required for accurate recommendation is different for different 
algorithms and systems. 



B. Estimated accuracy versus real accuracy 


In real systems, since the real preference of users is unknown, there is no way to measure 
the real recommendation accuracy. Various metrics are thus introduced to evaluate recom¬ 
mendation accuracy. Nevertheless, whether these metrics correctly measure real accuracy is 
questionable. Since user taste and product genre are defined in our model, we can compare 
the accuracy measured by these metrics with the real accuracy. 

One common metric to evaluate recommendation accuracy is AUC, i.e. the area under 
the receiver operating curve (ROC). When recommendations are made for user i, AUC is 
computed as the probability that a correct product a is ranked higher than an arbitrary 
product 7 , given by 


AUC,a 


njrj^ < Via) + 0.5n(rj^ = rjg) 

M-ki 


(4) 


where n{ri^ < Via) is the number of products with score r,^ lower than the score Via of 
the correct product, and nir,^ = rj^) is the number of items which tie with the correct 
item. Based on the definition of correct predictions, we compute two AUC measures - (i) 
the conventional estimated AUCest, obtained by dividing the dataset into a training set and 
a probe set; links in the probe set are removed and are considered to be correct predictions 
if their existence are predicted; and (ii) the real AUCreai which quantifies the accuracy of 
the algorithm in recommending products of a matching taste. 

The dependence of AUCest and AUC^-^ai on /sei is shown in Fig. 3. As we can see, 
AU Creal ~ 0.5 when /gei is small since recommendations are random (see Fig. 1) and the 
products of a matching taste are randomly ranked in the recommendation list. However, 
AUC'est is much higher and is not consistent with AUCreai- The reason for a large AUCest 
at small fse\ is the frequent application of recommender systems, such that user purchases 
are strongly infiuenced by the algorithms regardless of their true preference. In this case, 
products which do not match their preference but are consistent with the algorithms are 
also collected by the users. This favors the evaluation by AUCest using a random probe set, 
and lead to a high AUCest even random recommendations are indeed provided. 

When /sei increases, AUCest decreases since the user-product relations become less infiu¬ 
enced by the recommender system. At the same time, AUC^eaX increases since more hints 
about the user tastes are present. We remark that although Tree ~ l/G when fse\ is smaller 



than the threshold (see Fig. 1(a)), the corresponding AUC^-eax is increasing in the same 
regime. Finally, AUC^-eai and AUCest become consistent when /sei further increases and the 
system achieves perfect recommendation. 

The above results imply that the conventional evaluation of recommendation accuracy 
may not necessarily reflect the true accuracy. Indeed, AUCest may over-estimate the accu¬ 
racy of the algorithm, especially in cases where users rely frequently on the recommender 
system and do not reveal their own taste by deliberately selecting products. This serves as 
an alarm for researchers and operators of recommender systems. Alternative evaluations are 
therefore necessary to supplement conventional accuracy metrics to quantify the benefit of 
recommender systems for users. 

C. Users with multiple tastes 

Ordinary users usually have more than one interests, for instance, a user may be interested 
in both scientific fiction and action movies. To model this scenario, we assume that each user 
is characterized by two tastes, which we denote by taste 1 and taste 2. Similar to the previous 
case, with fsei of the times, the user selects a product in the absence of recommender systems; 
otherwise, the recommendation algorithm is applied. When a user selects a product, /i of 
the selected products are in taste 1 and the rest are in taste 2. To simplify the model, 
we only study cases with large fse\, with which perfect recommendation is achieved in the 
original single-taste system. 

Since a fraction fi of the selected products of the user are in taste 1, the ratio /i/(l — /i) 
corresponds to his/her preference between the two tastes. If optimal recommendations are 
achieved, /i of the recommended products should be in taste 1 and 1 — /i of them should be 
in taste 2. Nevertheless, as shown in Fig. 4, the fraction Tree of the recommended products 
in taste 1 does not coincide with the optmial line Tree = fi- For instance, when /i is small, 
the recommendations are mainly in taste 2. It leads to a sub-optimal state which under¬ 
represent the minority taste, i.e. taste 1 when fi < 0.5, among the recommended products. 
Similarly, taste 2 is under-represented when fi > 0.5. As we can see in Fig. 4, the difference 
between Tree and fi is larger when k is larger. This implies an increasing difficulty for 
the recommender system to identify a secondary taste if the user-product connections are 
denser. We remark that the results by employing the common neighbor similarity and the 



cosine similarity are almost identical. 

On the other hand, one may expect a perfect recommendation regime at /i/sei > 
where /.d denotes the threshold valne, or eqnivalently the smallest /sei at which the system 
achieves perfect recommendation in the corresponding single-taste scenario. For the system 
parameters employed in Fig. 4, ^ 0.73, bnt perfect recommendations in taste 1 are 

not achieved with fi > /s*gi//sei = 0.77 (indicated by the dotted line in Fig. 4) due to the 
presence of taste 2. 


D. Tests with empirical datasets 

Finally, we incorporate our model with a real dataset obtained from MovieLens [16]. 
Since user taste and product genre are unknown in real systems, we again randomly divide 
the dataset into a training set and a probe set, and consider the recommended movie to 
be correct only if it was collected by the user and received a rating of 3 (in a scale from 1 
to 5) from the user as recorded in the data. Similar to our model, with /sei of the times, 
a user deliberately selects a correct movie and otherwise the recommendation algorithm is 
applied. For those users who rated at least two movies with a score of 3 or above, we set 
their degree to be /c* — 1 such that an un-collected correct movie always exists. As in the 
previous simulations, a user randomly removes one of his/her collected movies when he/she 
obtains a new movie; the system is then repeatedly updated. 

As shown in Fig. 5(a), the accuracy Arec obtained by both similarity definitions starts at 
a low value and increases with fse\- Nevertheless, it does not show an abrupt jump to a high 
value similar to previous simulations but a plateau at small /sei and a small jump at large /sei 
are observed in the case with cosine similarity. These results again suggest that sufficient 
hints about user taste are essential for the system to obtain accurate recommendations. 
When /sei approaches 1, Tree decreases since users have collected most of the correct movies 
through deliberate selection and it becomes more difficult for the recommender system to 
identify the fewer correct items among all the other items. 

As shown in Fig. 5(b), the dependence of AUCest and Af/Creai on f^ei is similar to that 
observed from the previous simulations. When /sei is small, the conventional AUC metric 
over-estimates the accuracy of the recommender system. Especially, AUC'est is highest when 
AUCreai IS lowest, and AUC'est = AUCreal only when /gei = 1. This suggests that conventional 



metrics may again be over-estimate recommendation accuracy in real systems. 


E. A recommendation algorithm with improvement 


Based on the previous results, we slightly modify the ICF algorithm to improve the 
recommendation accuracy. The rationale is simple - since products deliberately selected by 
users usually match their taste, we simply give a higher weight to these products during 
the computation of recommendation scores, by modifying the adjacency variable aia{t) as 
follows: 


0 




b 


if a ^ Ci{t), 

if O' G Ci{t) via recommendation, 
if o; G Ci{t) via selection, 


(5) 


where C'j(t) is again the set of products collected by user i at time t, and b > 1 is the bias 
on products collected via deliberate selection. The recommendation score of an item are 
then computed by the same formula Eq. (3). The recommendation accuracy obtained by 
the modified algorithm is compared to that of the original algorithm in Fig. 6. As we can 
see from Fig. 6(a) and (b), perfect recommendations are achieved at a smaller /sei when 
selected products are weighed more in the algorithm. Similar results are observed with 
the MovieLens datasets as shown in Fig. 6(c) and (d). These results imply that products 
deliberately chosen by users are essential information to improve recommendation accuracy. 


Discussion 

To reveal the benefit of recommender systems for users, we studied a simple model where 
users either choose their own products or follow the recommendations from the system. Our 
results show that the recommendations may be equivalent to random draws if users rely 
too strongly on the recommender system and do not reveal their own taste by deliberately 
selecting products. On the other hand, if sufficient information about their taste is present, 
recommendation systems are able to achieve high accuracy in matching appropriate products 
to users. For some recommendation algorithms, the increase in accuracy is abrupt once the 
amount of available information exceeds a threshold. These results imply that recommender 



systems can benefit users, but relying too strongly on the system may render the system 
ineffective. 

On the other hand, our study reveals the difficulties to obtain a realistic and accurate 
evaluation of recommendation accuracy. Since real user preference is unknown, evaluation of 
recommender algorithms usually involves removing a set of existing data and quantifies their 
accuracy by their success to retrieve the removed set. Our results show that such metrics 
do not necessarily reflect and may over-estimate the true accuracy of the algorithm. This 
is because the choice of products collected by users was previously influenced by the recom¬ 
mendation algorithms; the presence of these products may not reflect their true preference 
and may favor the evaluation by the conventional accuracy metrics. The disagreement be¬ 
tween the estimated and the real accuracy was observed in simulations with both generated 
network and a real dataset. These results imply that a high recommendation accuracy indi¬ 
cated by the conventional metrics may not necessarily imply a benefit for users. Alternative 
evaluations are necessary to supplement these metrics in order to quantify the effectiveness 
of the recommender systems. 
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FIG. 1: The accuracy of the recommender system as a function of fsei for different number of 
taste groups G. The simulation results were obtained with N = 2000 users and M = 100 products. 
Each user collects k = 1 products and is updated 1 x 10® times. Each data point was averaged 
over 50 instances. The common neighbor similarity Eq. (1) and the cosine similarity Eq. (2) were 
employed in (a) and (b) respectively. 
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FIG. 2: The accuracy ^rec of the recommender system as a function of /gei for different values of k, 
the number of products collected per user. The simulation results were obtained with N = 2000, 
M = 100 and G = 10. Each user was updated 1 x 10^ times, and each data point was averaged 
over 50 instances. The common neighbor similarity Eq. (1) and the cosine similarity Eq. (2) were 
employed in (a) and (b) respectively. 
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FIG. 3: The two different AUC measures, AUCest and AUCreai, as a function of /gei, obtained 
by IGF with common neighbor similarity and cosine similarity (inset) on systems with N = 2000, 
M = 100, A: = 3 and G = 10. 
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FIG. 4: The fraction j4rec of recommended items in taste 1 as a function of /i, the fraction of the 
selected products in taste 1. The simulations are obtained with N = 2000, M = 100, G = 10 and 
/sei = 0.95 for 5 X lO'^A^ updates averaged over 50 instances. Only results obtained with common 
neighbor similarity are shown. 










FIG. 5: (a) The recommendation accuracy as a function of /sei, obtained by incorporating our 
model with the MovieLens dataset with 944 users and 1683 products, and 5000 updates per user, 
(b) The corresponding estimated AUC^st and the real AUC^q^\ as a function of /sei. 
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FIG. 6: The accuracy A^ec of the original IGF compared with IGF biased on products collected via 
deliberate selection (with 6 = 2 in Eq. (5)). The resnlts are obtained by (a) the common neighbor 
(GN) similarity and (b) the cosine similarity on generated networks with N = 2000, M = 100, 
k = 7 and G = 10. The corresponding results on the MovieLens dataset are shown in (c) and (d). 






























